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ABSTRACT 

In order to estimate the classical coefficient cf 
test reliability, parallel measurements are needed. H. Gulliksen's 
matched random subtests method, which is a graphical method for 
splitting a test into parallel test halves, has practical relevance 
because it maximizes the alpha coefficient as a lower bound of the 
classical test reliability coefficient. This paper formulates this 
same problem as a zero-one programming problem, the advantage being 
that it can be solved by algorithms already existing in computer 
code. Focus is on giving Gulliksen's method a sound computational 
basis. How the procedure can be generalized to test splits into 
components of any length is shown. An empirical illustration of the 
procedure is provided, which involves the use of the algorithm 
developed by A. H. Land and A. Doig (1960), as implemented in the 
LANDO program. Item difficulties ani item-test correlations were 
estimated from a sample of 5,418 subjects — a sample size that is 
large enough to prevent capitalizing on chance in the Gulliksen 
method. Two data teUDles and one graph are provided. (Author/TJH) 
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Abstract 



Gulllksen's matched random subtests method is a graphical method to 
split a test Into parallel test halves which has practical 
relevance because it maximizes coefficient a as a lower bound to 
the classical test reliability coefficient. In the paper the same 
problem is formulated as a zero-one programming problem, the 
advantage being that it can be solved by algorithms already 
existing in computer code. It Is shown how the procedure can be 
generalized to test splits into components of any length. An 
empirical Illustration of the procedure concludes the pappr. 
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A Zero-one Progranmlng Approach to 
Gullilcsen's Matched Random Subtests Method 

In order to estimate the classical coefficient of test reliability, 
parallel measurements are needed. Methods proposed to meet this 
requirement in practice are retesting the same subjects with the 
same test after some time has elapsed or carefully constructing a 
parallel test and testing the same subjects with both instruments. 
As is known from prdctical experience, though, these methods do not 
work well. The main objection against the test-retest method is 
that replicate test administrations a;e impossible with living 
subjects who may exhibit all kinds of interfering processes as 
remembering earlier administrations, learning and forgetting 
between administrations, or taking a dislike to another 
administration. The parallel -forms method. In fact, constitutes a 
dilemma. It assumes that it is possible to construct two different 
tests with exactly the same measurement properties. Practical 
experience shows that this ideal may be attained to some extent but 
is never realized exactly. 

As a possible way out of this fundamental problem, Kuder and 
Richardson (1937) proposed their formulas 20 and 21 which can be 
estimated using (dichotomous) item and west scores from a single 
adffl1*>1 strati on. A generalization of these formulas to non- 
dlchotomous items or test components of any lengtn is known as 
Cronbach*s (1951) coefficient a : 
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0, n > 1. 



where <j (Yg) Is the variance of the scores Yg on test component 
2 

q, the variance of the score X, and n the number of components. 
The usual choices of test components In this InternaJ-conslstency 
method are the Individual test Items or test halves. Estimates of 
the test reliability based on the latter are known as split-halves 
estimates. A generalization of (1) to any split was Introduced by 
Raju (1977) and Is known as coefficient p^. 

Analysis of the reUtlonshIp of (1) to the definition of the 
reliability coefficient reveals that they are cqua' to each other 
only if the test components are essentially T-ec;u1vdlent ; 
otherwise (1) Is a lower bound to the test reliability (e.9.9 Lord 
& Novick, 1968, sect. 4.4). Although this requirement Is less 
restrictive than the one of parallel measurements. It seems to give 
rise to the same practical problems as for the test-retest and 
parallel -forms iTiethods. However, there Is a possibility of 
optimization that the latter methods do not possess. Since (1) Is a 
lower Dound to the reliability for any split of the test Into 
components, and these bounds are not necessarily equal, we may look 
for the split with the greatest lower bound and base our estimation 
of the reliability coefficient on this. It is for this reason that 
the Internal -consistency method has not only a practical but also 
some theoretical appeal. 

Gulliksen (1950) proposed a method for splitting tests optimally 
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Into halves which Is know as the matched random subtests methoJ. 
The method Is the only one available for this Important purpose and 
Is described In most textbooks on test theory (e.g., Allen i Yen, 
1979, sect. 4.4). In spite of this. It has not found Its 
Implementation In standard computer packages for test analysis and 
Is hardly being used on a routine basis the reason Deing that the 
method Is graphic and must be performed by hand. It Is the purpose 
of this paper to present a version of Gulllksen's method that Is 
derived from zero-one programming. Algorithms for this method exist 
and are amply avalla'.e In computer code. In the remainder of this 
paper, first bulllksen's method Is outlined. Then, a zero-one 
programming version of the method Is proposed. Next, this version 
is illustrated using empirical test data. The paper ends with some 
concluding remarks and recommendations. 

Gulllksen's Matched Random Subtests Method 

Gulllksen's method Is usually formulated for dichotomous Item 
scores but can easily be generalized to other situations. For 
dichotomous It^m scores, the method Involves two parameters for 
each Item, Its difficulty and discriminating power. 
Let and p^^ denote the classical definitions of these 
parameters. Then the former Is the expected Item score and the 
latter the point biserlal correlation between the item and the test 
score. Each item is plotted on a graph with its values for the two 
parameters as coordinates. Next, pairs of items are formed, the 
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criterion being that each pair should have points on the graph as 
close to each other as possible. Test halves are obtained by 
assigning one randomly chosen item from each pair to one test half 
and the remaining items to the other. 
Figure 1 shows a typical Gulliksen plot. The points are estimates 



Insert Figure 1 about here 



for a 20-1tem version of a mathematics achievement test used In trie 

Second Mathematics Study of the International Associatior for the 

Evaluation of Education based on a Dutch sample of 5418 subjects. 

The same data will be used in the empirical example below. Note 

that sc*!ne pairs in Figure 1 are obvious. Others, however, are not. 

Item 16, for Instance, could be ,^a1red with 19 but this choice has 

consequences for the pcsslbllKies of 8, whereas the choice for 

this Item, in turn, restricts the possibilities for 2, and so on. 

In fact, It Is tne absence of a clear-cut criterion for coping with 

such dependencies that may make the method hard to be used for 

larger sets of Items. 

Let Yg In (1) represent the observed score on test half g which 

consists of ng items (g - 1, 2). A well-known result from the 

classical test theory is that, for dichotomous itens, the expected 

values and variances of Yg can be written as functions 

of and Assuming p.^ p^^^ for g = 1. 2, as is 

9 9 
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f 

Implicitly done In the Gulllksen method, the expressions are 

"g 

(2) ' I 

^g 1-1 ^ 

(3) a2 '[I nJl-n.)p..j2 

^g 1=1 ^ ^ 

Gulllksen's method Is motivated by the fact that pairwise matching 
of the items on n. ensures that ^ and (iy are approximately 
equal. Hence, a necessary condition for the two halves to have the 
same true scores Is met. As matching on p.^ also ensurf's 
approximately equal values of (3) for g * 1> 2, the two halves may 
have equal error variances and meet the requirements of parallel 
measurements. 

As already mentioned, Gulllksen's method is graphic. It supposes 
the presence of a judge Inspecting the graph and pairing the Items. 
It Is not a algorithm In the sense that all its rules can be 
written in computer code. As Illustrated earlier. Its criterion for 
pairing the items is not unequivocal. Therefore, situations may 
arise where the judge does not know with certainty which of the 
possible pairs to choose. Also, the random assignment of items from 
pairs to test halves may be suboptlmal. In particular, when the 
Items within pairs are not close to each other there is a non- 
negligible probability for random assignment to result in test 
halves being less parallel than necessary. Another desirable 
Improvement on the method would be an algorithm equally well 
applicable to splits Into other components than test halves. Splits 
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of tests Into thirds or quarters, for Instance, require the 
division of J\e plots Into triples or quadruples of Items. It Is 
unlikely that this can be done satisfactorily for larger tests just 
by inspecting plots. On the other hand, since such splits also 
yield values for il) that are lo^er bounds to the reliability 
coefficient, and It ^''^ms unwise to confine the search of the 
greatest lower bound only to the subset of splits Into test halves. 
As a final comment on the Gulllksen method. It is noted that, like 
any other method of Item selection, the danger of chance 
capitalization may arise If It is used with sample statistics 
Instead of parameters. For this reason. It can only be recommended 
as a large-sample solution to the problem of splitting a test Into 
parallel halves. The same holds If the zero-one programming 
formulation of Gu'il1ksen*s method given below Is used with 
statistics Instead of parameters. 

A Zero-one Programming Version of Gulllksen's Method 

Gulllicsen's methoc'. consists of two steps— pairing the Items and 
assigning items from pairs to test halves. Both tasks can be 
performed using techniques from zero-one programming. Interest In 
the application of zero-one programming techniques originated 
recently from a paper by Theunlssen (1985) who applied them to 
solve the problem of automated test design In Item response theory. 
This problem Is pursujd further In Theunlssen and Verstralen (1986) 
and van der Linden and Boekkool-Tlmmlnga (1986), whereas Boekkool- 
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Tlmnlnga (1986a, 1986b) provide extensions to the problems of 
simultaneous test design and the design of parallel tests in Item 
response theory. The techniques used below have a close 
relationship to the ones In the last two references but are applied 
in the context of classical test theory here while also use is made 
of the minlmax approach In van der Linden and 8oekKoo1-T1mm1nga. 




Pairs of Items 

In Gull1Ksen*s method the Items are paired on inspection. It Is 
suggested to replace this situation by the following unequivocal 
criterion. In the graph the Euclidean distance 



(4) 



^1j = [^-j)'*'PiX-Pjx)']'^ 



between the points 1 and J (l^j) Is considered. It Is proposed to 
pair the Items such that the sum of the w1 thin-pair distances Is 
minimal. In the following, as Is necessary In the Gulliksen method. 
It Is supposed that n Is an even number. (If n Is odd, one item 
msl be deleted and a Spearman-Brown correction with factor n/(n*l) 
should be applied to the eventual reliability estimate). Let x^j be 
a binary decision variable denoting whether i and j are a pair or 
not. That is. 



'1 i and j are a pair, 
0 otherwise, 1 < j. 



The problem is to decide on the n(n-l)/2 values of x^j such that 
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the criterion of a minimal sum of distances is met. Now the 
product 6wX.. is equal to the distance between i and J if they 
are a pair, and to zero otnerwise. The problem is thus to minimize 
the sum of these products subject to the constraints th' ich item 
has to be a meiiiber of exactly one pa r. In the usual zero-one 
programming format the problem is as follows 

n-1 n 



(5) minimize I I 6.,x 

1=1 J=i+1 



subject to 

j-1 n 

(6) I \i ^ I * ^ ^ •••• " 
i«l i-j+1 

(7) x^j c {0. 1} 1 « 1. n-1 

j « 1+1, • • • , n 



where for notational convenience the sums in (6) are equal to zeru 
if the upper bound to the index is smaller than the lower bound, or 
conversely. The objective function in (5) is defined as the 
minimization of the sum of all wi thin-pair distances t 
constraints in (6) guarantee that for each item the decision 
variables x^j (i < j) take the value 1 exactly once, which means 
that each item arrives in exactly one pair. In (7) the decision 
variables are constrained to be binary. 

The problem in (5) - (7) is a standard zero-one programming 
problem that Is found in textbooks on linear programming (e.g.. 
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Wagner, 1975, chap. 13). Algorithms to solve the proMem can De 
found in the same references and have been implemented in various 
computer programs. In the empirical example below, the program 
LANDO is used which is based on the branch-and-bound method by Land 
and Doig (1960). The output of the program is the n(n-l)/2 va^'ies 
of the decision variables x^j with n/2 values equal to 1 and the 
remaining ones to 0. 

Assigning Items to Components 

The optimization procedure could stop here to randomly assign items 
from pairs to test halves, as is done in the GuUiksen method. 
However, it is also possible to match the test halves further, for 
instance, on their average scores or variances. In both cases the 
problem is a zero-one programming problem again. If the latter 
option Is chosen, the problem Is to match the test halves on their 
sums of the terms ^^'^-^^IPix ^" Since, by definition, there 
are only two test halves, matching the two sums is equivalent to 
minimizing the sum with the larger value. Formulating the problem 
using this minimax criterion has the advantage that it can easily 
be generalized to other splits than test halves. This 
generalization will be shown below. 

The output of the previous problem is a set of n/2 pairs. Let 
(p,q) denote the p^^ item (p « 1, 2) in the q^^ pair (q ^ 1, .... 
n/2) and define a binary decision variable x (r « 1, 2) as 




1 



Item (p,q) is assigned to test half r. 



(8) 



0 



otherwise. 
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The assignment problem can be formulated as 
(9) minimize z 
subject to 



2 n/2 



1 

(in y X = 1 r = 1. 2 

par 

p«l 



q = 1. n/2 



(12) I X. - 1 q = 1. .... n/2 

r=l ^ 



(13) Xp^^ . {0, 1} p = 1. 2 

q = 1. n/2 



where % and p are the item difficulty and discrimination 
pq PQ 

Indices for item (p,q). The constraints in (10) ensure that the 
standard deviations of the two test halves are not larger than the 
upper bound z minimized in (9). The constraint in (11) requires 
that the items in a pair are assigned to different test halves each 
consisting of n/2 Items, whereas (12) requires that each item is 
assigned exactly once. The constraints in (11) - (12) could be 
simplified by replacing (8) by a variable Xp^^ equal to 1 if (p.q) 
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has to be assigned to the first test half and equal to 0 otherwise* 

but then the generalization to other splits than tests halves to be 

presented below Is not so obvious. 

The same analysis could be done with « ^ as coefficients in (IG) 

pq 

matching the test halves on their average scores, with weighted 

combinations c% + (l-ch «{l-ii^«)p^«, 0 < c < 1, as coefficients, 
pq pq pq ^pq» - - » 

or with inequality constraints on the averaaes (variances) added to 
the model matching the test halves on tneir variances (averages). 
All these options are due to the fact that the underlying problem 
of matching test halves on parallelness is one of multiple- 
objective decision making. The wealth of choices does not need to 
bother us muc'^, because the previous pairing of the items already 
ensures us a igh match of the test halves on both their averages 
and variances before they enter this stage of optimization. In the 
emnirical example below weighted coefficients with c - .5 are used. 
This choice is In the same spirit as the first stage in Gu11ilcsen*s 
method where in (4) n. and p^^ are also weighted equally. 

Generalization to Other Splits 

The above can easily be generalized to other splits than test 
halves. The case of a split into thirds, for instance, is modeled 
as follows. 
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Triples of Items 

It is assumed that n Is a multiple of three. Then the withln-trlple 
"distance" Is defined as 6.^^ = 6^j+ 6^^+ 6j,^ for all triples 
(I.J.k), 1 # J. J Ic and 1 It k, and the decision variable x^j,^ Is 
equal to one only If 1. j and k are In the same triple and equal to 
zero otherwise (1 < j < k). 
The problem Is now 



n-2 n-1 n 
(14) minimize I I I 6^j,x 
1=1 j=1+l k=J+l ^ 



subject to 

k-l j-1 k-1 n 

(15) I I X, + I I X 

j=2 1=1 ^ 1»1 j=k+l 

n j-1 

+ I I \ii = 1 ^ ' 1 " 

J=k+2 1«k+l 



(16) x^j^ t {0. 1} 1-1 n-2 

j » 1+1. .... n-1 
k = j+1 n 



where for notatlonal convenience undefined sums in (15) arc put 
equal to zero again. The values In the upper and lower bounds In 
(5) follow from the requirement that x^jj^ be defined for 1 < j < k 
only. 
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Assigning Items to Conponents 

If in (9) - (13) the indices run as follows 

(17) p » 1. 2. 3 

q » U •••• n/3 
r = 1, 2. 3 

the model assigns Items from triples to test components of size 
n/3. 

Conclusion 

The above iimediately suggests how the model can be generalized to 
splits Into test components of any length. 



In order to illustrate the procedures, the algorithm by Land and 
Dolg (1960) as implemented in the program LANOO was used together 
with the item data in Figure 1. The item difficulties and item-test 
correlations were estimated from a sample of 5418 subjects which is 
large enough to prevent from capitalizing on chance in the 
Gulliksen method. The estimates are presented in Table 1. 



An Empirical Example 



Insert Table 1 about here 
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As was clear from the blvarlate distribution of the estimates In 
Figure 1, It Is not Inmediate obvious how all these items should be 
paired by hand. Table 2 gives the optimal Item pairs following (5) 
- (7). The results of the assignment of the Items to test halves 
according to the 



Insert Table 2 about here 



optimization model In (9) • (13), with as coefficients In (10) the 
equally weighted sum of (2) and (3)» are Indicated In Table 2 by 
underscoring the items in the same test half. The results 
convincingly demonstrate the advantage of optimal assignment over 
the random assignment that takes place In the original version o^ 
the (kjlllksen method. For some pairs (e.g., 2-6, 5-15 and 16-17) 
the withln-pair distance is still large In spite of the fact that 
the pairing was optimal. This implies that there Is much space for 
further optimization. Random assignment makes no systematic use of 
this but the optimization model in (9) - (13) automatically selects 
from all possible assignments the one that matches the test halves 
closest. 

Discussion 

The loea to estimate ^ower bounds to the reliability coefficient 
from a sifigle test administration Is not uniquely associated to 
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Cronbach*s coefficient a. It Is reminded that other lower bounds 
to the reliability coefficient exist. One example is Raju's 
coefficient alreacjy referred to earlier. Raju (1982) offers some 
theory on the theoretical maximum of coefficient a under fixed 
variance of the test scores. Older exampies can be found in Guttman 
(1945), whereas Bentler and Woodward (1980) (see also ten Berge, 
Snijders» and Zegers, 1981) derive a whole chain of lower bounds. 
The idea of maximizing a lower bound is also present in Krammer and 
van der Linden (1986) who maximize the squared validity coefficient 
as a lower bound to the reliability coefficient across a set of 
linear combUations of external variables. All these approaches 
have different strong and weak points and require more or less 
Intensive computations. It is not the purpose of this paper to 
replace them by GulllKsen*s method. Its main intention is to give 
this method, which has alrea<ly gained its place in test theory and 
practice, a sound corK*Jtational basis. Besides, the same zero-one 
programming method can be used in any other situation where 
classically parallel tests are needed, e.g., for use in pretest- 
posttest designs in educational research or in testing situations 
where a secrecy problem exists. 
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Table 1 



Difficulty and Discrimination Values for the Twenty-item Test 



t^m 


1 




Item 


1 


PfV 


1 


.85 


.39 




.83 


.52 


2 


.50 


.41 




.68 


.54 


3 


.60 


.40 




.80 


.43 


4 


.66 


.60 




.84 


.45 


5 


.87 


.25 




.86 


.34 


6 


.28 


.37 




.52 


.47 


7 


.87 


.40 




.62 


.58 


8 


.48 


.48 


18 


.61 


.40 


9 


.74 


.47 


19 


.51 


.48 


10 


.65 


.60 


20 


.66 


.58 
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TaDle 2 



Optimal Item Pairs and Test Halves 



_1_ - 7 8-19 

2^-6 £ - 13 

3^-18 11-14 

4 - JO 12-20 

5-15 16-17 



Note Underscored Item numbers In same test half 
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Figure 1 . The Gullllcsen plot for a twenty-item test. 
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